Introduction¶
Main objective of this project is to compare different supervised learning methods to increase performance of AI classifying model.
I chose X (Twitter) posts dataset from Kaggle to extract modern language vocabulary.
This project includes:¶
- Analyzing and visualizing significant information from X posts.
- Preprocessing text data.
- Creating AI model that predicts sentiment from given sentence.
- Comparison and reflection on each model variant. #### Chapters:
- Dataset overview
- Processing text data 1) Removing duplicates 2) Balancing the data 3) Length of sentences
- Creating the Model 1) Class weighting 2) Text tokenizing 3) Training the model 4) Hyperparameter tuning 5) Testing in practice
- Conclusion
I - Dataset overview¶
Checking the character of dataset and looking for any significant clues that could be used later on.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv('tweets.csv', index_col=0)
df
| Datetime | Tweet Id | Text | Username | sentiment | sentiment_score | emotion | emotion_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2022-09-30 23:29:15+00:00 | 1575991191170342912 | @Logitech @apple @Google @Microsoft @Dell @Len... | ManjuSreedaran | neutral | 0.853283 | anticipation | 0.587121 |
| 1 | 2022-09-30 21:46:35+00:00 | 1575965354425131008 | @MK_habit_addict @official_stier @MortalKombat... | MiKeMcDnet | neutral | 0.519470 | joy | 0.886913 |
| 2 | 2022-09-30 21:18:02+00:00 | 1575958171423752203 | Asย @CRNย celebrates its 40th anniversary,ย Bob F... | jfollett | positive | 0.763791 | joy | 0.960347 |
| 3 | 2022-09-30 20:05:24+00:00 | 1575939891485032450 | @dell your customer service is horrible especi... | daveccarr | negative | 0.954023 | anger | 0.983203 |
| 4 | 2022-09-30 20:03:17+00:00 | 1575939359160750080 | @zacokalo @Dell @DellCares @Dell give the man ... | heycamella | neutral | 0.529170 | anger | 0.776124 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24965 | 2022-01-01 02:02:04+00:00 | 1477097760931336198 | @ElDarkAngel2 @GamersNexus @Dell I wouldn't ev... | Eodart | negative | 0.682981 | anger | 0.906309 |
| 24966 | 2022-01-01 01:57:34+00:00 | 1477096631300415496 | @kite_real @GamersNexus @Dell I didn't really ... | Eodart | positive | 0.743940 | joy | 0.951701 |
| 24967 | 2022-01-01 01:36:36+00:00 | 1477091355629432833 | Hey @JoshTheFixer here it is....27 4K UHD USB-... | Corleone250 | neutral | 0.654463 | anticipation | 0.471185 |
| 24968 | 2022-01-01 01:31:30+00:00 | 1477090070830141442 | @bravadogaming @thewolfpena @Alienware @intel ... | MrTwistyyy | neutral | 0.794049 | anticipation | 0.747014 |
| 24969 | 2022-01-01 00:59:37+00:00 | 1477082048900726784 | @rabia_ejaz @Dell Stopped buying windows lapto... | IDevourNehari | positive | 0.733861 | joy | 0.958346 |
24970 rows ร 8 columns
Since we want to predict just the sentiment form text data,
I decided to throw all unnecessary columns, leaving only 'text' and 'sentiment'.
df.drop(['Datetime', 'Tweet Id', 'Username', 'sentiment_score', 'emotion', 'emotion_score'], axis=1, inplace=True)
df.rename(columns={'Text': 'text', 'sentiment': 'label'}, inplace=True)
# renaming classes
labels = {
'negative': 0,
'neutral': 1,
'positive': 2,
}
df['label'] = df['label'].apply(lambda x: labels[x])
df
| text | label | |
|---|---|---|
| 0 | @Logitech @apple @Google @Microsoft @Dell @Len... | 1 |
| 1 | @MK_habit_addict @official_stier @MortalKombat... | 1 |
| 2 | Asย @CRNย celebrates its 40th anniversary,ย Bob F... | 2 |
| 3 | @dell your customer service is horrible especi... | 0 |
| 4 | @zacokalo @Dell @DellCares @Dell give the man ... | 1 |
| ... | ... | ... |
| 24965 | @ElDarkAngel2 @GamersNexus @Dell I wouldn't ev... | 0 |
| 24966 | @kite_real @GamersNexus @Dell I didn't really ... | 2 |
| 24967 | Hey @JoshTheFixer here it is....27 4K UHD USB-... | 1 |
| 24968 | @bravadogaming @thewolfpena @Alienware @intel ... | 1 |
| 24969 | @rabia_ejaz @Dell Stopped buying windows lapto... | 2 |
24970 rows ร 2 columns
II - Processing text data¶
Step 1 - Removing duplicates¶
The main problems with duplicates:
1) Duplicated sentences can lead to label favoritism.
2) One sentence may contain 2 different labels,
with that we can't really tell which one is the correct one.
df.isna().sum()
text 0 label 0 dtype: int64
df.duplicated().sum()
331
from copy import deepcopy
sns.set(style="whitegrid")
NAVY_LIGHT = '#4B527E'
NAVY_DARK = '#7C81AD'
# For better plot visualisation I saved duplicates in second data frame
df_duplicates = deepcopy(df)
df.drop_duplicates(subset='text', keep='first', inplace=True)
fig = plt.figure()
fig = sns.countplot(df_duplicates, x='label', color='red', linewidth=0)
fig = sns.countplot(df, x='label', color=NAVY_DARK, linewidth=0)
plt.tight_layout()
plt.title('Duplicates per label')
plt.legend(['duplicates', 'unique'])
<matplotlib.legend.Legend at 0x1c1a2c93710>
df.duplicated('text').sum()
0
Step 2 - Balancing the classes¶
The balance between each label seems a little incohesive especially if we look at the first column that stand out of the rest.
We downsample the classes 0 and 2 to the smallest one which is 1.
By undersampling classes we get rid of unbalanced data problem by the cost of less data for training the model.
df.value_counts('label')
label 0 10483 2 7167 1 6989 Name: count, dtype: int64
from sklearn.utils import resample
df_by_label = lambda i: df[df['label'] == i]
downsampled_df = pd.DataFrame(data=df_by_label(1))
max_size = len(df_by_label(1))
# resampling classes
for i in [0, 2]:
downsampled_class = resample(df_by_label(i),
replace=True,
n_samples=max_size,
random_state=42)
downsampled_df = pd.concat([downsampled_df, downsampled_class])
fig = plt.figure()
fig = sns.countplot(df, x='label', color=NAVY_LIGHT)
fig = sns.countplot(downsampled_df, x='label', color=NAVY_DARK, linewidth=0)
fig.axhline(y=max_size, color='r', linestyle=':')
fig.set_title('Balancing classes')
plt.tight_layout()
plt.show()
Step 3 - Length of sentences¶
By grouping sentences based on their length in words, we can identify and exclude groups with specific length that occurs less than 100 times in the dataset.
Basically, we remove sentences that are shorter than 3 words and longer than 55 words to achieve even more balanced dataset.
What I found intrestingly odd is that big jump in second plot in negative class at around 46 length.
Seems like negative sentences tend to be longer but this clue is too small to be taken seriously.
More safely we can deduce that negative class contain more long sentences.
from matplotlib.lines import Line2D
from scipy.stats import hmean
# creating 3rd column that contains length in words of corresponding text data
df['length'] = df['text'].apply(lambda x: len(x.split()))
downsampled_df['length'] = downsampled_df['text'].apply(lambda x: len(x.split()))
df_by_label = lambda x: df[df['label'] == x]
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(20, 10))
sns.countplot(df, x='length', ax=axes[0])
colors = ['red', 'grey', 'green']
for i, color in enumerate(colors):
count_by_length = df_by_label(i)['length'].value_counts().sort_index()
sns.lineplot(x=count_by_length.index, y=count_by_length.values, marker=None, color=color, linewidth=3, ax=axes[1])
axes[0].set_title('Whole dataset')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Length (in words)')
axes[0].set_xlim(-1, 65)
axes[1].set_title('By class')
axes[1].set_ylabel('Count')
axes[1].set_xlabel('Length (in words)')
axes[1].set_xlim(1, 70)
legend_labels = ['negative', 'neutral', 'positive']
custom_lines = [Line2D([0], [0], color=colors[i], lw=3) for i in range(len(colors))]
axes[1].legend(custom_lines, legend_labels)
print(f"{'Mean':>20}: {int(df['length'].mean())}\n"
f"{'Harmonic mean':>20}: {int(hmean(df['length']))}")
fig.suptitle('Amount of sentences per length', fontsize=20)
plt.tight_layout()
plt.show()
Mean: 26
Harmonic mean: 16
# cutting off the sentences, technically we don't need the length column for this
df_cut = df[(df['length'] >= 3) & (df['length'] <= 55)]
downsampled_df = downsampled_df[(downsampled_df['length'] >= 3) & (downsampled_df['length'] <= 55)]
min_count = len(df[df['length'] == 100])
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(30, 5))
sns.countplot(df, x='length', color=NAVY_LIGHT)
sns.countplot(df_cut, x='length', color=NAVY_DARK)
plt.axhline(y=100, color='red', linestyle=':')
print(f"{"Mean":>20}: {int(df['length'].mean())}\n"
f"{"Harmonic mean":>20}: {int(hmean(df['length']))}")
fig.suptitle('Amount of sentences per length', fontsize=20)
axes.set_xlabel('Length (in words)')
plt.show()
df = df_cut
Mean: 26
Harmonic mean: 16
III Creating the Model¶
Step 1 - Class weighting¶
First we prepare weights of our classes from unbalanced dataset to avoid label favoritism.
weights = dict()
for i in range(3):
weights[i] = len(df) / len(df[df['label'] == i]) * 6
weights
{0: 14.108060917644776, 1: 21.275599765944996, 2: 20.49894291754757}
Step 2 - Text tokenizing¶
To give our model more clear understanding of text data, we tokenize it.
With custom function we have more possibilities such as: removing emojis and mentions ('@username').
from sklearn.feature_extraction.text import BaseEstimator, TransformerMixin, TfidfVectorizer
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation
import emoji
class TextTokenizer(BaseEstimator, TransformerMixin):
def __init__(self, emoji=True, mentions=True):
self.stemmer = SnowballStemmer('english')
self.emoji = emoji
self.mentions = mentions
def fit(self, X, y=None):
return self
def transform(self, X):
processed_text = []
for text in X:
if self.emoji:
text = emoji.demojize(text, delimiters=("", " "))
text = text.replace("_", " ")
if self.mentions:
text = ' '.join(word for word in text.split() if word[0] != '@')
text = ''.join(char.lower() for char in text if char not in punctuation)
tokens = ' '.join(self.stemmer.stem(word) for word in text.split())
processed_text.append(tokens)
return processed_text
Testing functionality of our text tokenizer.
a = "๐๐ก i don't care! #happy"
b = "smiling face emoji"
c = '@user23 thats #cRaZy!'
tokenizer = TextTokenizer()
for sentence in [a, b, c]:
print(f"before: {sentence}\n after: {tokenizer.transform([sentence])}")
print('')
before: ๐๐ก i don't care! #happy after: ['smile face with smile eye enrag face i dont care happi'] before: smiling face emoji after: ['smile face emoji'] before: @user23 thats #cRaZy! after: ['that crazi']
Step 3 - Training the model¶
Splitting datasets for downsampled and unbalanced versions.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
# TRAIN TEST SPLIT FOR DEFAULT DATA
x = df.iloc[:, 0]
y = df.iloc[:, 1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# TRAIN TEST SPLIT FOR DOWNSAMPLED DATA
x_2 = downsampled_df.iloc[:, 0]
y_2 = downsampled_df.iloc[:, 1]
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(x_2, y_2, test_size=0.2, random_state=42)
Now we create 3 variants with the same AI Model:
- sgd - Unbalanced (weighted)
- sgd_2 - Unbalanced (weighted) + custom tokenizer
- sgd_ds - Downsampled
- sgd_ds_2 - Downsampled + custom tokenizer
Unbalanced basically uses the unbalanced dataset while balanced, downsampled one.
The pipeline below is actually after hyperparameter tuning.
sgd_ds_2 = Pipeline([
('tok', TextTokenizer()),
('vec', CountVectorizer(ngram_range=(1, 2), stop_words=stopwords.words('english'))),
('clf', SGDClassifier(random_state=42, n_jobs=-1, shuffle=True))])
# default - unbalanced data, no custom transformer
sgd = deepcopy(sgd_ds_2)
sgd.set_params(clf__class_weight=weights)
sgd.steps.remove(sgd.steps[0])
# default - unbalanced data + custom transformer
sgd_2 = deepcopy(sgd_ds_2)
sgd_2.set_params(clf__class_weight=weights)
# downsampled, no custom transformer
sgd_ds = deepcopy(sgd_ds_2)
sgd_ds.steps.remove(sgd_ds.steps[0])
Just making sure if we assigned parameters correctly.
sgd, sgd_2, sgd_ds, sgd_ds_2
(Pipeline(steps=[('vec',
CountVectorizer(ngram_range=(1, 2),
stop_words=['i', 'me', 'my', 'myself', 'we',
'our', 'ours', 'ourselves', 'you',
"you're", "you've", "you'll",
"you'd", 'your', 'yours',
'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she',
"she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', ...])),
('clf',
SGDClassifier(class_weight={0: 14.108060917644776,
1: 21.275599765944996,
2: 20.49894291754757},
n_jobs=-1, random_state=42))]),
Pipeline(steps=[('tok', TextTokenizer()),
('vec',
CountVectorizer(ngram_range=(1, 2),
stop_words=['i', 'me', 'my', 'myself', 'we',
'our', 'ours', 'ourselves', 'you',
"you're", "you've", "you'll",
"you'd", 'your', 'yours',
'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she',
"she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', ...])),
('clf',
SGDClassifier(class_weight={0: 14.108060917644776,
1: 21.275599765944996,
2: 20.49894291754757},
n_jobs=-1, random_state=42))]),
Pipeline(steps=[('vec',
CountVectorizer(ngram_range=(1, 2),
stop_words=['i', 'me', 'my', 'myself', 'we',
'our', 'ours', 'ourselves', 'you',
"you're", "you've", "you'll",
"you'd", 'your', 'yours',
'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she',
"she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', ...])),
('clf', SGDClassifier(n_jobs=-1, random_state=42))]),
Pipeline(steps=[('tok', TextTokenizer()),
('vec',
CountVectorizer(ngram_range=(1, 2),
stop_words=['i', 'me', 'my', 'myself', 'we',
'our', 'ours', 'ourselves', 'you',
"you're", "you've", "you'll",
"you'd", 'your', 'yours',
'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she',
"she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', ...])),
('clf', SGDClassifier(n_jobs=-1, random_state=42))]))
sgd.fit(x_train, y_train)
pred_default = sgd.predict(x_test)
sgd_2.fit(x_train, y_train)
pred_default_2 = sgd.predict(x_test)
sgd_ds.fit(x_train_2, y_train_2)
pred_ds = sgd_ds.predict(x_test_2)
sgd_ds_2.fit(x_train_2, y_train_2)
pred_ds_2 = sgd_ds_2.predict(x_test_2)
from sklearn.metrics import confusion_matrix, classification_report, f1_score
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
mat_default = confusion_matrix(y_test, pred_default)
mat_default_2 = confusion_matrix(y_test, pred_default_2)
mat_downsampled = confusion_matrix(y_test_2, pred_ds)
mat_downsampled_2 = confusion_matrix(y_test_2, pred_ds_2)
for i, mat in enumerate([mat_default, mat_default_2, mat_downsampled, mat_downsampled_2]):
sns.heatmap(mat.T, annot=True, fmt='d', square=True, cbar=False, cmap='Blues', ax=axes.flatten()[i])
axes.flatten()[0].set_title('Unbalanced data,weighted')
axes.flatten()[1].set_title('Unbalanced data, weighted, custom tokenizer')
axes.flatten()[2].set_title('Downsampled data')
axes.flatten()[3].set_title('Downsampled data, custom tokenizer')
for i in range(4):
axes.flatten()[i].set_xlabel('true label')
axes.flatten()[i].set_ylabel('predicted label')
print(f"Unbalanced data,weighted:\n{classification_report(y_test, pred_default)}\n\n")
print(f"Unbalanced data, weighted, custom tokenizer:\n{classification_report(y_test, pred_default_2)}\n\n")
print(f"Downsampled data:\n{classification_report(y_test_2, pred_ds)}\n\n")
print(f"Downsampled data, custom tokenizer:\n{classification_report(y_test_2, pred_ds_2)}")
plt.tight_layout()
plt.show()
Unbalanced data,weighted:
precision recall f1-score support
0 0.81 0.82 0.81 2064
1 0.64 0.67 0.66 1374
2 0.79 0.74 0.76 1410
accuracy 0.75 4848
macro avg 0.74 0.74 0.74 4848
weighted avg 0.75 0.75 0.75 4848
Unbalanced data, weighted, custom tokenizer:
precision recall f1-score support
0 0.81 0.82 0.81 2064
1 0.64 0.67 0.66 1374
2 0.79 0.74 0.76 1410
accuracy 0.75 4848
macro avg 0.74 0.74 0.74 4848
weighted avg 0.75 0.75 0.75 4848
Downsampled data:
precision recall f1-score support
0 0.85 0.84 0.85 1390
1 0.76 0.78 0.77 1362
2 0.86 0.85 0.86 1369
accuracy 0.83 4121
macro avg 0.83 0.83 0.83 4121
weighted avg 0.83 0.83 0.83 4121
Downsampled data, custom tokenizer:
precision recall f1-score support
0 0.87 0.86 0.87 1390
1 0.77 0.81 0.79 1362
2 0.88 0.84 0.86 1369
accuracy 0.84 4121
macro avg 0.84 0.84 0.84 4121
weighted avg 0.84 0.84 0.84 4121
Step 4 - Hyperparameter tuning¶
Using Grid Search Cross Validation method we test all possible parameter combinations and return the one that performs the best.
For this part I chose the model with balanced data and custom tokenizer. It is quite important to find some Biasโvariance tradeoff to make it work also in practice.
from sklearn.model_selection import GridSearchCV
parameters = {
'tok__emoji': [True, False],
'tok__mentions': [True, False],
# 'vec__ngram_range': [(1,1), (1,2), (2,3)],
# 'clf__loss': ['hinge', 'log_loss', 'modified_huber', 'perceptron', 'huber', 'epsilon_insensitive'],
# 'clf__penalty': ['elasticnet', 'l1', 'l2', None],
# 'clf__learning_rate': ['constant', 'optimal', 'adaptive', 'invscaling'],
# 'clf__shuffle': [True, False],
# 'clf__alpha': np.linspace(0, 10, 5),
# 'clf__epsilon': np.linspace(0, 10, 5),
# 'clf__eta0': np.linspace(0, 1, 5),
}
gs_clf = GridSearchCV(sgd_ds_2, parameters, n_jobs=-1, verbose=1)
gs_clf = gs_clf.fit(x_train_2, y_train_2)
print(gs_clf.best_score_)
print(gs_clf.best_params_)
We copy previous cells to update our model with new parameters and retrain them.
sgd_ds_2 = Pipeline([
('tok', TextTokenizer()),
('vec', CountVectorizer(ngram_range=(1, 2), stop_words=stopwords.words('english'))),
('clf', SGDClassifier(random_state=42, n_jobs=-1, penalty='elasticnet', shuffle=True,
loss='log_loss', learning_rate='adaptive', eta0=0.11, alpha=7e-05, epsilon=0.8)),
])
# default - unbalanced data, no custom transformer
sgd = deepcopy(sgd_ds_2)
sgd.set_params(clf__class_weight=weights)
sgd.steps.remove(sgd.steps[0])
# default - unbalanced data + custom transformer
sgd_2 = deepcopy(sgd_ds_2)
sgd_2.set_params(clf__class_weight=weights)
# downsampled, no custom transformer
sgd_ds = deepcopy(sgd_ds_2)
sgd_ds.steps.remove(sgd_ds.steps[0])
sgd.fit(x_train, y_train)
pred_default = sgd.predict(x_test)
sgd_2.fit(x_train, y_train)
pred_default_2 = sgd.predict(x_test)
sgd_ds.fit(x_train_2, y_train_2)
pred_ds = sgd_ds.predict(x_test_2)
sgd_ds_2.fit(x_train_2, y_train_2)
pred_ds_2 = sgd_ds_2.predict(x_test_2)
from sklearn.metrics import confusion_matrix, classification_report, f1_score
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
mat_default = confusion_matrix(y_test, pred_default)
mat_default_2 = confusion_matrix(y_test, pred_default_2)
mat_downsampled = confusion_matrix(y_test_2, pred_ds)
mat_downsampled_2 = confusion_matrix(y_test_2, pred_ds_2)
for i, mat in enumerate([mat_default, mat_default_2, mat_downsampled, mat_downsampled_2]):
sns.heatmap(mat.T, annot=True, fmt='d', square=True, cbar=False, cmap='Blues', ax=axes.flatten()[i])
axes.flatten()[0].set_title('Unbalanced data,weighted')
axes.flatten()[1].set_title('Unbalanced data, weighted, custom tokenizer')
axes.flatten()[2].set_title('Downsampled data')
axes.flatten()[3].set_title('Downsampled data, custom tokenizer')
for i in range(4):
axes.flatten()[i].set_xlabel('true label')
axes.flatten()[i].set_ylabel('predicted label')
print(f"Unbalanced data,weighted:\n{classification_report(y_test, pred_default)}\n\n")
print(f"Unbalanced data, weighted, custom tokenizer:\n{classification_report(y_test, pred_default_2)}\n\n")
print(f"Downsampled data:\n{classification_report(y_test_2, pred_ds)}\n\n")
print(f"Downsampled data, custom tokenizer:\n{classification_report(y_test_2, pred_ds_2)}")
plt.tight_layout()
plt.show()
Unbalanced data,weighted:
precision recall f1-score support
0 0.84 0.84 0.84 2064
1 0.67 0.69 0.68 1374
2 0.79 0.76 0.77 1410
accuracy 0.77 4848
macro avg 0.76 0.76 0.76 4848
weighted avg 0.77 0.77 0.77 4848
Unbalanced data, weighted, custom tokenizer:
precision recall f1-score support
0 0.84 0.84 0.84 2064
1 0.67 0.69 0.68 1374
2 0.79 0.76 0.77 1410
accuracy 0.77 4848
macro avg 0.76 0.76 0.76 4848
weighted avg 0.77 0.77 0.77 4848
Downsampled data:
precision recall f1-score support
0 0.85 0.85 0.85 1390
1 0.77 0.77 0.77 1362
2 0.86 0.86 0.86 1369
accuracy 0.83 4121
macro avg 0.83 0.83 0.83 4121
weighted avg 0.83 0.83 0.83 4121
Downsampled data, custom tokenizer:
precision recall f1-score support
0 0.87 0.85 0.86 1390
1 0.76 0.80 0.78 1362
2 0.87 0.84 0.86 1369
accuracy 0.83 4121
macro avg 0.83 0.83 0.83 4121
weighted avg 0.84 0.83 0.83 4121
Step 5 - Testing in practice¶
For tests, I prepared 3 lists that coresponds to each sentiment,
every list contains 10 sentences with ~5 words length and 10 sentences with ~10 words length.
negative_sentences = [
# ~5 Words
"I hate this so much. ๐ก",
"Worst experience ever. ๐ #disappointed",
"This is so frustrating. ๐ ",
"Totally not worth it. ๐ธ",
"@user213 You ruined everything. ๐ค",
"I canโt stand this. ๐",
"This is pure garbage. ๐ฎ",
"Unbelievably bad service. ๐ค #neveragain",
"What a waste of time. ๐",
"Seriously the worst ever. โ",
# ~10 Words
"I can't believe how awful this turned out to be. ๐ก #fail",
"@user213 You really let me down, very disappointed in you.",
"This is the worst product Iโve ever purchased. Refund, please! ๐ธ",
"Completely ruined my day, thanks for nothing. ๐ #neveragain",
"Everything about this is just terrible, never using it again. โ",
"@user213 Your customer support is useless and unhelpful, so frustrating! ๐ค",
"Honestly, I expected much better from you, this is trash. ๐ฎ",
"Canโt believe I wasted money on this, so regretful. ๐ธ",
"I am so upset right now, what a huge letdown. ๐ก",
"Honestly, this entire experience has been nothing but a headache. ๐ "
]
neutral_sentences = [
# ~5 Words
"It was okay, nothing special. ๐คทโโ๏ธ",
"Not good, not bad either.",
"Just an average experience today.",
"Meh, itโs alright I guess. ๐",
"@user213 Could be better, honestly.",
"This is neither here nor there.",
"Just a regular day. ๐ค",
"I feel indifferent about it.",
"Nothing to complain about. ๐คทโโ๏ธ",
"It's fine, I suppose. ๐ถ",
# ~10 Words
"I guess itโs just fine, nothing really stood out. ๐คทโโ๏ธ",
"Not amazing, but not terrible either, just kind of average.",
"@user213 Itโs okay, not sure how I feel about it.",
"This was pretty much what I expected, nothing surprising here.",
"Honestly, Iโm neither impressed nor disappointed, just neutral. ๐",
"I donโt really have a strong opinion on this one.",
"Itโs fine, nothing to rave about or criticize. ๐คทโโ๏ธ",
"Neither satisfied nor dissatisfied, just another average experience. ๐ค",
"@user213 It was pretty standard, nothing particularly great or bad.",
"Iโd call it a very average experience, to be honest. ๐ถ"
]
positive_sentences = [
# ~5 Words
"Absolutely loved it! ๐ #amazing",
"This made my day! ๐",
"Fantastic job, @user213! ๐",
"Iโm so happy! ๐ฅณ #blessed",
"Worth every penny! ๐ฐ",
"Super excited about this! ๐",
"Best decision ever made. ๐",
"Love this so much! ๐",
"So proud of you, @user213!",
"Canโt stop smiling! ๐",
# ~10 Words
"Iโm incredibly happy with this, exceeded all my expectations! ๐",
"Thank you, @user213, for such a fantastic experience! ๐ #grateful",
"This product has genuinely improved my life, super grateful! ๐",
"I am over the moon with how this turned out. ๐",
"Wow, just wow! Couldnโt have asked for anything better. ๐",
"Amazing experience from start to finish, highly recommend! ๐ #bestdayever",
"Iโm so glad I tried this, totally worth it. ๐",
"You nailed it, @user213! Iโm really impressed! ๐",
"I couldnโt be happier with the results, totally satisfied. ๐",
"This exceeded my expectations, truly a delightful surprise! ๐"
]
score_df = {
'Model': ['positive', 'neutral', 'negative', 'average'],
'Default': [0, 0, 0, 0],
'Default + Custom tokenizer': [0, 0, 0, 0],
'Downsampled': [0, 0, 0, 0],
'Downsampled + Custom tokenizer': [0, 0, 0, 0]
}
score_df = pd.DataFrame(score_df).set_index('Model')
for i_2, model in enumerate([sgd, sgd_2, sgd_ds, sgd_ds_2]):
avg_score = 0
for i_1, sentiment in enumerate([negative_sentences, neutral_sentences, positive_sentences]):
predictions = model.predict(sentiment)
score = (np.array(predictions == i_1).sum() / 20) * 100
score_df.iloc[i_1, i_2] = score
avg_score += score
avg_score = round(avg_score / 3, 1)
score_df.iloc[3, i_2] = avg_score
score_df
| Default | Default + Custom tokenizer | Downsampled | Downsampled + Custom tokenizer | |
|---|---|---|---|---|
| Model | ||||
| positive | 80.0 | 100.0 | 75 | 95 |
| neutral | 45.0 | 50.0 | 40 | 65 |
| negative | 80.0 | 95.0 | 80 | 95 |
| average | 68.3 | 81.7 | 65 | 85 |
fig = plt.figure(figsize=(10, 6))
fig = sns.barplot(data=score_df.iloc[3], palette=sns.color_palette('Blues', 4))
fig.set_title('Score vs Model', fontweight='bold')
fig.set_ylim(0, 100)
plt.show()
IV Conclusion¶
I'm very satisfied with practical test results, the custom tokenizer works way better in practice and actually had some kind of impact.
I think that after more tweaking it has potential to be used in social media websites such as: Instagram, Twitter or YouTube.
Thank you for reading this project and I hope you enjoyed the whole process :]¶
@ Gracjan Pawลowski 2024